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Abstract 

We present a new approach for the analysis of genome-wide expression data. Our 
method is designed to overcome the hmitations of traditional techniques, when ap- 
plied to large-scale data. Rather than alloting each gene to a single cluster, we assign 
both genes and conditions to context-dependent and potentially overlapping tran- 
scription modules. We provide a rigorous definition of a transcription module as the 
object to be retrieved from the expression data. An efficient algorithm, that searches 
for the modules encoded in the data by iteratively refining sets of genes and conditions 
until they match this definition, is established. Each iteration involves a linear map, 
induced by the normalized expression matrix, followed by the application of a thresh- 
old function. We argue that our method is in fact a generalization of Singular Value 
Decomposition, which corresponds to the special case where no threshold is applied. 
We show analytically that for noisy expression data our approach leads to better 
classification due to the implementation of the threshold. This result is confirmed 
by numerical analyses based on in-silico expression data. We discuss briefly results 
obtained by applying our algorithm to expression data from the yeast S. cerevisiae. 



* Correspondence should be addressed to: Nacmia.Barkai@weizmajin.ac.il 
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1 Introduction 



DNA microarray experiments monitor the expression levels of thousands of genes simul- 
taneously 0,0,^^. Using this technology, large sets of genome- wide expression data 
have been accumulated |^. For example, the expression levels of the entire yeast genome 
(comprising ~ 6200 genes) have been measured for more than 1000 different experimental 
conditions 0. A large number of DNA chip experiments have also been carried out for 
higher eukaryotes, such as the nematode C. elegans and the fruit fly Drosophila, as well as 
for a variety of both normal and malignant human tissues. 

While large scale expression data have the potential to reveal new insights into the 
transcriptional network that controls gene expression, they also give rise to a major com- 
putational challenge: How can one make sense of the massive expression data containing 
millions of numbers? The classification of the genes and the experimental conditions is an 
essential first step in reducing the complexity of such data. However, while standard tools, 
like clustering algorithms 0, ||, ||, 0, |13|, [l^ (see |]15|, |16| for reviews) and Singular 
Value Decomposition (SVD) |T^,0, provide interesting results when applied to relatively 
small data sets, typically containing tens of experimental conditions and at most several 
hundred genes, these methods are of limited use for the analysis of large data sets. In 
particular, a well-recognized drawback of commonly used clustering algorithms is the fact 
that they assign each gene to a single cluster, while in fact genes that participate in several 



functions should be included in multiple clusters |]19|,^^^. Moreover, both in stan- 
dard clustering methods and SVD, genes are analyzed based on their expression under all 
experimental conditions. This is problematic, since cellular processes are usually affected 
only by a small subset of these conditions, such that most conditions do not contribute 
relevant information but rather increase the level of background noise. 

In a recent paper we introduced a new method for the analysis of large-scale 
gene expression data that was designed to overcome the above-mentioned problems (see 
Refs. pT|,[2^ for other recent approaches). A central idea of this work was to integrate 
prior biological information, like the function or sequence of known genes, into the analysis 
of the gene expression data. In the present article we present a complementary method 
for the analysis of large-scale data that does not require any prior knowledge beyond the 
expression data. We start by providing a rigorous definition of the type of information 
we aim to extract from the expression data by introducing the notion of a transcription 
module (TM). A TM contains both a set of genes and a set of experimental conditions. 
The conditions of the TM induce a co-regulated expression of the genes belonging to this 
TM. That is, the expression profiles of the genes in the TM are the most similar to each 
other when compared over the conditions of the TM. Conversely, the patterns of gene 
expression obtained under the conditions of the TM are the most similar to each other 
when compared only over the genes of the TM. The degree of similarity is determined 
by a pair of threshold parameters. The gene threshold constrains the gene set, while 
the condition threshold constrains the condition set. Importantly, distinct transcription 
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modules may share common genes and conditions. 

The precise definition of a TM as the object to be retrieved from the expression data 
allows us to establish an efficient algorithm that searches for the modules encoded in the 
data. Starting from a set of randomly selected genes (or conditions) one iteratively refines 
the genes and conditions until they match the definition of a TM. Using a sufficiently 
large number of initial sets it is possible to determine all the modules corresponding to a 
particular pair of thresholds. Scanning through a range of thresholds decomposes the data 
into modules at different resolutions. 

This paper is organized as follows: In section ^ we provide a mathematical definition 
of a transcription module. In section ^ we introduce our algorithm that searches for such 
modules and compare our method with SVD. In section § we discuss the normalization of 
the expression data. In section ^ we present analytical insight into the role of the threshold 
in our algorithm. We show that for noisy expression data the application of a threshold 
improves significantly the identification of transcription modules. We provide an estimate 
for the maximal amount of noise for which a successful identification is still possible. In 
section |^ we compare our method with other standard tools using in-silico expression data. 
In section |^ we discuss briefiy results obtained by applying our algorithm to real expression 
data from the yeast S. cerevisiae. We conclude in section |[ 



2 Formalism 



2.1 The Expression Matrix 

We consider data from microarray experiments given in terms of a gene expression ma- 
trix E. The matrix element E'^^ denotes the log-fold expression- change of gene g E G = 
{1, Ng} at the experimental condition c G C = {1, Nq}, where Nq and Nq refer to 
the total number of genes and conditions, respectively. The matrix E may be viewed as a 
collection of A'",^ row vectors: 

f 9l\ 

T 

E= . (1) 

Each vector = {g^^\g^\ ...,g^^'^^) describes the gene-profile for condition c, containing 
the expression levels c/(») = E'3 of all the genes that were monitored under this condition. 
Alternatively the expression matrix can be viewed as a collection of Nq column vectors: 

E={ci,C2,...,CNg) ■ (2) 

Here each vector Cg = {c^^\c^p, c^J^^^)'^ describes the condition-profile for gene g, con- 
taining the expression levels = E'^B of this gene under all the conditions of the data 
set. 
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We define two normalized expression matrices (c.f. section 



( 9i \ 

92 



(3) 



and 



(4) 



The rows of Eq and tlie columns of Ec are given in terms of the normalized gene- and 
condition- vectors 



9c 



9 c {9c)geG 



\9c 



{9c) 



and 



^9 i^9)ceC 



geG\ 



(5) 



cgcI 



respectively. These vectors have zero mean ((^c)geG ~ (^g)cec ~ ^) ^^^^ length 
{\9c\ = \^g\ = This normalization implies that J2g = 0, J^giEQ ^ = 1 for each 
condition c and J^c^c = 0, I]c(-^c)^ — 1 each gene g. Centering and re-scaling 
the rows in Eq allows for a meaningful comparison between any two conditions c and c' 
through their associated gene-profiles and g^,. Similarly, centering and re-scahng the 
columns in Ec allows for the comparison of any two genes g and g' through their associated 
condition-profiles Cg and Cg/. Note that the normalized matrices Eg and Ec in general 
are not equal. 



2.2 Transcription Modules 

Our goal is to find sets of co-regulated genes Gm C G, together with the relevant experi- 
mental conditions Gm C G that induce their co-regulation. We refer to such a combined 

set, Mm ~ {Gm,} Gm}: '^^ 

transcription module (TM). Here the index m ranges between 
one and the number of transcription modules, Nm- Biologically a TM may be associated 
with a particular cellular function. Ideally each TM would correspond to a transcription 
factor that regulates the genes in Gm and that is activated under the conditions in Gm- 
Of course, a one-to-one correspondence between transcription modules and transcription 
factors is an over-simplification, but it can still provide useful insight into the nature of 
the expression data. First, the total number of transcription factors, Ntf, is much smaller 
than the number of genes: Ntf Nq- Thus we expect also the number of transcription 
modules, and therefore the effective dimensionality of the expression matrix to be rela- 
tively small: Nm -C Nc- Second, the number of genes activated by a single transcription 
factor, NcP\ is known to be limited: NcP^ <^ Nq- Third, different transcription factors 
can regulate the same gene and can be activated under the same experimental conditions. 
Hence distinct modules may share common genes and conditions. 
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Mathematically a TM can be defined as follows: 



3(Tc, Tg) 




{ceC: {E^S),,c,. > Tc} 
{geG: {E^c')cec^ > To} 



(6) 



where Tc and Tg are two threshold parameters. The above definition states that for each 
condition c in the TM the average expression level of the genes in the TM, (Eq) ^^q^, is 
above a certain threshold Tc- Conversely, for each gene g in the TM the average expression 
level over the conditions of the TM, (E^)^^^ , is also above some threshold Tg- This 
reciprocal dependence between the genes and the conditions associated with a TM implies 
that, considering only the genes of the module, the conditions of the module are exactly 
those for which the co-expression is the most stringent. Similarly, considering only the 
conditions of the module, the genes of the module are the most tightly co-regulated. Note 
that our definition of a TM is symmetric with respect to genes and conditions, such that 
no preference is given to either of them. In particular, we use the expression matrix Eg 
(normalized with respect to genes) in order to specify the conditions of the module (Cm), 
given the genes of the module (Gm)- Similarly we use Ec (normalized with respect to 
conditions) to specify the genes in 0^,, given the conditions in Cm- 

We would like to reformulate and somewhat generalize the definition of a TM in eq. (^ 
by introducing vector notation. To this end we represent the genes and the conditions of 
a TM by a pair of a gene-vector = {gmi9mi ■■■^9m^^)'^ a condition-vector = 
(c^\ c^-*, c^'^'')"^. A non-zero component g!^^ (Cm) implies that the gene g (condition 
c) is associated with the module m. Consider the linear transformations 



f 9l9m\ 



= Eg Or 



and 



9', 



^ C-i Cm ^ 
^2 C-m 



\9Nc9mJ 



(7) 



No' 



J 



The resulting vectors contain the projections of the vectors and Cm, that specify the 
TM, onto the set of the (normalized) gene-profiles {gd and condition-profiles {cg}, defined 
in eq. (||), that describe the expression data. For a binary vector g^ the components of 
c^"^ are just the expression levels summed over the genes of the TM for each condition in 
the data set. Likewise for a binary vector Cm the components of g^°^ are the expression 
levels summed over the conditions of the module for each gene. 
The consistency requirement in eq. (^ can then be written as 



3(tc, tG) 



'm I 



(8) 



, 9m = ftGi9 

where tc and tc are the condition- and the gene-threshold, related to Tc and Tc, respec 
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tively. The threshold function 



I w{x\) ■ 6(xi —t) \ 



(9) 



acts separately on each of the A^^; components Xj of the vector x and yields the prod- 
ucts of a weight-function w{x) and a step-function Q{x) as output. The arguments of 
the step-function, — fi{x))/a{x), have been centered and re-scaled. We use 

the mean as center, fi{x) = (x), and the expected or measured standard deviation. 



(t{x) = yYlf'^ixi — {x)y/Nx, as scale-factor. The step-function sets to zero all elements 
of the vector x that do not exceed ^{x) by at least t ■ (t{x). (Down- regulation can be 
captured by replacing Xi ^ \xi\ in eq. @.) Using w{x) = 1 as weight-function all the sig- 
nificant elements are set to unity. This binary formulation corresponds to the consistency 
requirement in eq. (^. (To capture down-regulation one uses sign(x) as weight-function.) 
It is straightforward to extend our formalism using different weight-functions. In this case 
the entries of the gene- and condition-vector become continuous, and their value deter- 
mines the significance of a particular gene or condition, respectively. As we shall see, a 
particularly relevant choice is w{x) = x in which case ft{x) is semi-linear. 

The compact definition of a TM in eq. (|^) can be understood as follows: Applying 
the threshold function /^^ to c^°-^ results in a non-zero component of the module's 
condition- vector c^, if the corresponding gene-profile is sufficiently aligned with the 
gene-vector of the module. Biologically this means that a significant fraction of the 
genes in the module are co-regulated under condition c. Similarly, the application of 
/ig to g^"^ results in a non-zero component g^^^ in the module's gene-vector g,^, if the 
corresponding condition-profile Cg is sufficiently aligned with the condition-vector of 
the module. Biologically this implies that a significant fraction of the conditions in the 
module induce a co-regulated expression of gene g. 

It is important to note that the content of a particular module Mm = {Gm, Cm} depends 
on the pair of thresholds (tctc)- In many cases for slightly larger thresholds there exists 
a related module M^, such that C Mm- Similarly, for somewhat smaller thresholds 
there usually exists a module M^""'", such that Mm C M^"^"-. Thus there are nested sets 
of modules, M^^ C ... C M^"""™ that persist over a finite range of the thresholds. This 
hierarchical structure resembles the tree structures obtained from clustering. However, in 
our case distinct branches may share common genes or conditions. 



3 The Iterative Signature Algorithm 

The rigorous definition of a transcription module in principle allows us to determine the 
modules encoded in the expression matrix by testing all possible sets {Gm, Cm} for their 
compliance with eq. (^. However, since the number of such sets scales exponentially with 
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the number of genes and conditions, such an approach is completely infeasible computa- 
tionally. We therefore suggest a different approach. Our principle idea is to search for 
solutions of the consistency equation in (H) through the map defined by 



c 



(n+l) _ f (JP^^(n) 



f,,{EG9n, (10) 

gin^l) ^ (11) 

The first equation assigns a condition- vector 0^"^+^^ to a given gene- vector g^'^\ We refer 
to the component c^""'"^^ of this vector as a condition score. This score is non-zero only if 
the corresponding gene-profile g^, defined in eq. (|^), is sufficiently aligned with the gene- 
vector In the subsequent step in eq. ( pH]) the component (or gene score) (7^"^^^ of the 
gene-vector g^~^^^ is assigned a non-zero value only if the corresponding condition-profile 
Cg is sufficiently aligned with the condition- vector c^"*"^-*. 



In a recent work we have applied the map in eqs. (|T0|) and (|ri|) to a variety of 
biologically motivated input-sets {gf^^} assembled according to prior knowledge of the 
regulatory sequence or function of the genes. Sets of co-regulated genes and co-regulating 
conditions were constructed from recurrent realizations of the output-sets defined by g^^^ 
and c'-^-' . In this work we pursue a different strategy, namely we apply the maps in eqs. ( p!OD 
and ( pUj) iteratively by re-using the gene-vector g^^^ as input for eqs. ([I0|) and ( pA]) in 
order to obtain new output-sets defined by c^"^^ and g^'^\ Repeating this procedure we 
obtain {g^^\ c^^^} from g^"^^ and so on. In general, the series {g^^\ g^^\ g^'^\ g^^\ ■■■} rapidly 
converges to a "fixed point" gene-vector g^*\ In general the series {g^^\ g^^\ g^'^\ g^^\ ■■■} 
rapidly converges and we can define a "fixed point" gene- vector gi*-" •* which satisfies 



\g{*) 

\g{*) + g,(n)| 



< e (12) 



for all n above a certain number of iterations. The parameter e determines the accuracy of 
the fixed point, g^*^ depends both on the "seed" g^^^ and the thresholds to and tc, which 
are fixed parameters. Together with the associated condition-vector c^*^ it defines a TM, 
since {g^*\c^*^) by definition solve eq. (|^). We call this procedure the Iterative Signature 
Algorithm (ISA). 

Although the set of possible input seeds is huge, usually there exist only a rather limited 
number of fixed points for a given set of thresholds [tctc)- Therefore, in general the ISA 
is applied as follows: (1) generate a (sufficiently large) sample of input seeds {g^^}, (2) find 
the fixed points {gl^\cl^) corresponding to each seed through iterations and (3) collect 
the distinct fixed points in order to decompose the expression data into modules. The 
structure of this decomposition depends on the choice of thresholds (to, tc)- Scanning over 
different values for (tctc) reveals the modular structure at different resolutions: Lower 
thresholds yield larger units whose co-regulation is relatively loose, while higher thresholds 
lead to smaller, tightly co- regulated modules. Each fixed point (fir^^, c^^) has its "basin of 



attraction", i.e. the set of seeds that converge to it under the iterative scheme in eqs. (10) 
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and (|Tl|) . The size of this set is a measure of the "convergence radius" , while the average 
number of iterations, that is needed until eq. (0) is satisfied, characterizes the "depth" of 
this basin. 

The computation time of any algorithm, designed for the analysis of large scale ex- 
pression data, is of crucial importance. For algorithms that require the full correlation 
matrices (like clustering or SVD), already the computation of these two matrices can be 
very intensive, since its computation time scales like f^Znp NqNq + N^Nq. However, 
the ISA is not based on this kind of information. Rather than squaring the expression 
matrix, only multiplications of the expression matrix with sparse matrices (of size Nq x Nj 
or Nq X Nj), where Ni is the number of input sets, have to be performed. Due to the 
sparseness, the computation time of the ISA goes like tcfmp °^ ^uerNi^NcNc + NqNc), 
where Nq and Nc refer to the average number of genes and condition, respectively, whose 
scores are above the threshold, and Nuer is the number of iterations until convergence. 
Thus the computation time of the ISA scales linearly with A^^ and Nq- In general only 
very few iterations Nuer are needed to find the fixed points. A large number of input sets 
Nj increases the chances to find the fixed points with a small convergence radius. However, 
for practical purposes it is useful to accumulate progressively sets a fixed points by running 
the ISA repeatedly with a moderate value for Nf, thus increasing gradually the accuracy 
of the fixed point decomposition. Importantly, Nq and Nc are much smaller than Nq and 
Nc as long as the respective thresholds are high enough. Finally, we note that t^^^p could 
be further improved by choosing the input seeds not completely at random, but using the 
information of previous runs (e.g. those at a different threshold). 

3.1 Comparison with Singular Value Decomposition 

For w{x) = X, in the absence of thresholds and neglecting the two different normalizations 
of the expression data, the iterative scheme reads 



^^"^ = ^^7^> (13) 



Q{n) ^ E C 

3 |^T^(n)| • 

The fixed points of the above equations correspond to the pairs of vectors (^„, c^), where 
9m = 9m/\9m\ ^-nd = Cm/\Cm\ are the normalized eigenvectors of E and EE'^, re- 
spectively. Both eigenvectors are associated with the common eigenvalue /i^ = \Eg^\^ = 
|-E"^CmP- It is interesting to note that a Singular Value Decomposition (SVD) of the ex- 
pression matrix yields exactly those eigenvectors and eigenvalues p4|,^ (see appendix |A.l 



for brief review of SVD). This decomposition is usually performed in a sequential manner. 
In this case one determines first the pair {qi, ci) associated with the largest eigenvalue fif. 
In fact this pair emerges as a fixed point of the above equations for any seed g^^^ that is 
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not perpendicular to g^. It can be shown that the matrix 

Ei = iiiCigl. (15) 

provides the best rank-1 approximation to E = Ei + Ri, where Ri denotes the residual 
term. A subsequent diagonalization of -Ri yields the (orthogonal) pair (^25^2) associated 
with the second largest eigenvalue fi2- Continuing this procedure eventually decomposes 
the expression matrix into a sum 

Nm 

E = Y.E^^ + Rn^ (16) 

m 

of the rank-1 matrices E^ = ^^mCmO^ with = |Cm||fl'm,l- These matrices can be viewed 
clS cl special kind of transcription modules. 

One of the advantages of SVD is that the significance of each modular component E^ 
can be determined simply according to the magnitude of the associated eigenvalue. The 
components associated with small eigenvalues are likely to reveal no real information and 
to contain only noise. Thus the spectrum of eigenvalues can give some indication of the 
dimensionality of the data: The existence of Nm eigenvalues that are significantly larger 
than the remaining eigenvalues suggests that there are Nm dominant components. Similar 
to SVD the lengths of the fixed point vectors of the ISA provide a measure of the relative 
importance of the associated TM. Specifically, Ifl'^-'P = Y^g^Gmid^g^Y reflects the size of the 
gene set and (for w{x) = x) the strength of its co-regulation, while jc^-'P = Z]cgC™(4*'*)^ 
reflects the size of the condition set and the strength of the co-regulation induced by this 
set. 

While the similarity between the ISA and SVD is instructive, there are several important 
differences: 

• Applying the threshold functions in eqs. ([T^) and ([TT]) yields a different spectrum 
of fixed points: Sets of genes that are fixed points of the iterative scheme for a 
particular choice of the threshold, in general do not correspond to the eigenvectors 
of the expression matrix. 

• The thresholds affect the stability of the fixed points: While the iterations in eqs. (|T^ 
and ( p!4| ) have only a single stable fixed point {g^, ci), the ISA in eqs. (p!OD and ( pT]) 
usually possesses several stable fixed points. This is essentially because the thresholds 
induce an "effective orthogonality" by setting the small scalar products in eq. (^ to 
zero. Consequently input sets that are almost (but not exactly) orthogonal to the 
strongest fixed point, do not flow towards this point under the iterations, but converge 
to a different fixed-point. 

• SVD is very sensitive to the (unavoidable) noise in the expression data. This noise 
induces mixing between modules that would be orthogonal to each other in the 
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absence of noise. In the ISA the threshold function provides an efficient way to deal 
with such noise. Excluding the bulk of the genes and conditions from the expression 
data at each step of the iterative procedure allows to pick up co-regulated units that 
would otherwise be masked by the noise. 

For SVD distinct eigenvectors g^^ and g^^, as well as Cm and c^' are orthogonal to each 
other, since they diagonalize a symmetric matrix. The constraint of orthogonality is 
not present in the ISA. 

SVD only reveals one single decomposition of the expression matrix into modules. 
As for the ISA, changing the values of the thresholds allows to analyze the modular 
structure recorded in the expression matrix at different resolutions. 

For SVD the expression data has to be normalized either according to genes or con- 
ditions. The choice of data normalization in general follows from the interpretation 
of the data. Demanding maximal variance among the principal components, one is 



led to center the data either as in Eq or Eq (see appendix |A.1| on SVD for details). 
Thus the symmetry between the genes and the conditions is explicitly broken when 
committing to either Ec or Eq- In contrast, the ISA avoids this bias by alternating 
between the two possible normalizations at each step of the iterative procedure in 
eqs. (IIOD and (M). 



We will discuss now some of these points in more detail. 



4 The proper data normalization 

Given the "raw" expression data contained it is difficult to compare two experiments [g^ 
and g^,) or two genes (c^ and c^/). This is because different experiments may affect the 
expression levels at a different scale. For example one condition may change the expression 
of many genes by a very large factor (s> 1) while another condition affects mainly the 
same genes, but shifts their expression level by a much smaller amount. Although the 
two conditions are related, this relation is not explicit in the expression data. Moreover, 
recording the expression levels with different microarray techniques as well as variations in 
the sample preparation can change the scale of the results. Similarly the dynamic range 
of two distinct genes could differ greatly even though the shape of their condition profiles 
might be similar. To overcome this difficulty we have introduced the normalized matrices 
Eg and Ec (c.f. eqs. (|) and (D). 

In order to study the impact of the normalization on our algorithm we generated an 
in-silico expression matrix E corresponding to two overlapping modules of equal size and 
strength (see section |^ for more details on the model used to generate these data). We 
selected random scale factors Sg, Sc G [0, 1] for each gene g and condition c from a uniform 
distribution and transformed the elements of the expression matrix according to E^^ 
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E^g = E'^^SgSc- Unlike the original expression matrix E, the re-scaled expression matrix Es 
(shown in Fig. |I]a) corresponds to the realistic scenario where the entities of the expression 
data have been recorded at different scales. From Es we calculated the normalized matrices 
Ec and Eq- 

The question we ask is which normalization has to be employed in order to reveal the 
"correct" genes from the conditions associated with the underlying module, and which 
normalization leads to the "correct" conditions, given the genes of the module. To answer 
this question we defined the vectors Qi and Ci by assigning non-zero components only for the 
genes and conditions of one of the modules, respectively. Using these vectors we computed 
cs = Egg I, cc = EcQi and Cq = EgQi as well as Qs = E'^Ci, = E'^Ci and g^ = 
EqCi. The components of the resulting gene- and condition- vectors are plotted in Fig. |l|b 
and c, respectively. 

One can see that only for g^^ and Cq (corresponding to the the "correct" normalizations 
as used in the ISA, c.f. eqs. ( |T0D and (|TTD) all the components associated with the genes 
and conditions of the module (specified by (gf]^,Ci)) are significantly larger than the oth- 
ers. For missing or "wrong" normalization there are large fluctuations among the vector 
components. Hence applying a threshold would only capture part of the relevant genes or 
conditions in this case. Thus Eq is best suited to identify the genes of a module from a set 
of conditions that is a good approximation of Cm, while Eq is the proper normalization 
to obtain the conditions of a module from a set of genes close to Gm- Note that using 
these "correct" normalizations, it is even possible to distinguish the genes and conditions 
associated exclusively with the specified module from those that belong also to the other 
module, because the latter obtain a somewhat lower score. 

5 Analysis of the ISA 

The fundamental issue is how well the ISA can reveal relatively small, noisy, and possibly 
overlapping modules from the expression matrix. In this section we address this ques- 
tion by considering a simple model where the expression matrix corresponds to a single 
transcription module. Our idea is to consider the gene-vector that undergoes iterations 
as a stochastic entity and to study how its distribution evolves under the iterations. This 
approach allows us to quantify how the efficiency of our algorithm depends on the size of 
the module and the noise in the expression data. 

5.1 Linear recursions 

In the following we consider a slightly simplified iterative scheme, where no threshold 
function is applied to the condition vector. In this case one can write an iterative equation 
that depends only on the gene vector. If, moreover, no gene threshold is applied the 
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iterations are defined through the hnear equation (c.f. eq. (48) in the Appendix) 

= -T^ fl7) 

Here the matrix C = E emerges from applying first eq. (|13]) and then eq. ([T^). As we 
mentioned before the fixed points of this linear recursion are the eigenvectors of C. 

Let us consider the simplest scenario corresponding to a single set of co-regulated genes 
Gi <Z G whose co-regulation is triggered by the conditions in Ci C G . Specifically, we 
assume that all the genes in Gi are equally important, such that a noise-free measurement 
would result in identical condition profiles for these genes. In this ideal case the matrix 
elements C^^' would equal some constant if both g and g' belong to Gi and be zero otherwise. 
In order to model the effect of noisy data we consider the elements of C as random variables 
with mean value 

g,g'eGi 
otherwise ' 

and variance Vc for all g, g' G G. In the absence of noise (i.e. Vc = 0) the matrix C possesses 
only a single (non-trivial) eigenvector g^^\ whose non-zero components specify the genes 
of the TM. However, for Vc > this is not true anymore. 

Assume we knew the eigenvector of C for Vc = and use it as a (binary) seed g^^^ for 
eq. {^n\ ) with a noisy realization of C (i.e. Vc > 0). The question is whether the fixed- 
point resulting from gf^°) still characterizes the genes of the module. In general the vector 
g^^^ obtained by the first iteration does not coincide with g^'^\ Due to the probabilistic 
description of C we can only determine the mean and the variance of the components of 
gr(^) = Cg^^\ The mean of g^g^ = J2g' C^^' 9^^^ is equal to the number of genes in the module, 
N(P\ times ^c 9 ^ Gi, and zero otherwise. Similarly the variance of is iVg"Vc. Here 
we only used the additivity of the mean and the variance. However, already for g^^^ in the 
next iteration we need to deal with products of random variables. To this end we note that 
for two independent random variables a and b we have (see appendix [A.2| for proof) 

(ab) = (a) (b) and V{ab) = V{a) V{b) + V{a) {bf + V{b) {af . (19) 

Using these results we find that the mean values of the components of the vector gr(") = 
are given by 

)-\ 9^Gi' ^ 

where /Iq denotes the mean of the components g^^~^^ associated with the module 
{g e Gi). Only for the genes in Gi there are Nq^'' matrix elements in C that contribute 
constructively to ((7^"^). Similarly, the variances of g^^^ are 

V!,""^ ^ ANaVcVt'^ + {VcVt'^ + Vcif.t'^y + Vt'^^^l) 9 e G, 
V^,''^^/\NaVcVt'^+N^o^Vc{vt'^ + {^^^S~'^f) g ^ G, ' 

(21) 



V{g, 



12 



where ANq = Nq — Nq^^ denotes the number of genes that do not belong to the module. 
Note that V^"'' has an additional term with respect to VcP\ due to the contribution of the 
non-zero mean values in C. 

In order to assess whether the iterations improve the separability between distributions 
of the genes within {g G Gi) and outside {g ^ Gi) the module, we introduce the re-scaled 
variances 



KPg) Wg 



2 



Note that Vq and Vq are dimensionless and invariant under the normalization of the gene- 
vectors. Vq <^ 1 implies that the distribution of the genes associated with the module is 
well separated from the distribution of the genes that do not belong to the module. Using 
eqs. (^) and (pT]) we obtain the following recursive equations 



(n-1) 



G 

where f c = Vc//ic is the (fixed) noise-to-signal ratio of the expression matrix. 

If A*"^'"'' ^ 1 the second term in eq. (plj) is negligible and we can ignore the small differ- 
ence between f^-* and v'q\ Then, setting v'q'^ = Vq^ in eq. ( ^3]) leads to the approximate 
recursive equation 

^(n) ^ NgVc ^,(n-l) ^25) 



This equation converges to 



^ ' vc N^J"^ 



(26) 



provided that 



vc < - . (27) 

For further reference we state this result also for the signal-to-noise ratio 

(n) 

>)^_^ = (J"))-i/2. (28) 
The corresponding fixed-point value equals to 

1/2 



{*) 

Pg 



{pI - iprr)] ' , (29) 
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if 



pc-^>pr'-^. (30) 



and is zero otherwise. 

The interpretation of the critical value Vq"^ for the noise in the expression data is 
straightforward: Only sets of genes that are sufficiently large and whose co-regulation 
is recorded in the expression matrix with relatively low noise (i.e. vc < v'q^*) can be 
captured by the iterative procedure without threshold in eq. (0). Actually eq. (0) is only 
a necessary condition for the identification of a module, since for a reliable separation of 
the distributions of the gene-scores associated with the module, we need p^Q 3> 0. As 
we mentioned before, the number of genes associated with cellular functions is expected 
to be rather limited, Nq^'^ -C Nq- Therefore we conclude that eq. (|30| ) presents a serious 
limitation for the extraction of biologically relevant modules through the analysis of the 
eigenvectors of C (as in SVD). 



5.2 Noise reduction by the threshold function 

As discussed in the previous section the noise in the expression data may obstruct the 
identification of a TM. A fundamental aspect of the threshold functions in the ISA is to 
reduce the effect of such noise by excluding the bulk of the genes and conditions that do 
not contribute information but rather increase the level of background noise. 

To illustrate this point, let us repeat the study of noise propagation presented above 
for the simplified iterative scheme like in eq. (|T7|) , but with the linear map followed by a 
threshold function: 

9^") = /*(C^("-^)), (31) 

where ft is defined in eq. (|^) and we use a linear weight-function w{x) = x. Let us assume 
that the gene scores are distributed according to normal distributions J\f{x; fi, a), where fi 
and a refer to the mean and the standard deviation of the random variable x. As a result 
of the threshold function only 

Nt^=Nt^ fAf{p-pt'\l)dp (32) 

genes from the module contribute constructively to the mean in eq. (pO]). Similarly, only 
Nq^"^ genes from the module and 

ANg = ANg Ar{p]0,l)dp (33) 

genes outside the module contribute to the variance of g^^^ in eq. (^). Nq^'^ is the expected 
number of genes in the module, whose score has not been set to zero by the threshold 
function. Similarly, ANq is the expected number of genes that do not belong to the 
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module, but have a non-zero score. The crucial point is that, because of the different mean 
values of the two distributions, the threshold function excludes more genes that do not 
belong to the module than genes that do belong to the module. For example, if Pq = 3 
for the initial (normal) distribution, then a threshold t = 2 would remove almost 98% 
of the genes outside the module {ANq — 0.023 x ANq), but less than 16% of the genes 
associated with the module (iV^™^ ~ 0.841 x Nq^^). We note that the precise shape of the 
distribution function is in fact not crucial, since our derivation relies only on the additivity 
of the mean values and variances, and eq. (|T^. 

It follows that the mean values and variances of the components of the vector gr(") are 
given by the same expression as in eqs. (|20|) and (|2l|) , respectively, except that we have 
to replace Nq^^ by iY^"^^ and ANq by ANq- Substituting the effective numbers iVg"^ and 
ANg into eqs. (|2y) and (^1]) the argument leading to the expression for the fixed-point 
signal-to-noise ratio in eq. (^) is essentially unchanged, and we have 



(*) 
Pg 



Nt^ {pI - rPcy)r , (34) 



with 



PC - J^) • (35) 

Note that unlike for eq. (p9D , the right-hand side of eq. (0) still depends on through 
Nq^\ Therefore eq. ( P^ is an integral equation for p^^ which can be solved numerically. 
A graphical solution of this equation is provided in Fig. El for different thresholds and a 
specific choice of the parameters Ng, Nq and vq (see caption for details). 

As can be seen in Fig. ^ applying a threshold function improves significantly the 
identification of the module. We show the fixed point value of the signal-to- noise ratio, p'q\ 
as a function of both the threshold t and the (fixed) signal-to-noise ratio pc of the expression 
data. In the absence of a threshold function p[^^ converges to zero if pc is below some critical 
value p^^^. Applying a threshold, p^Q^ converges to a finite value, even if pc < p^^^ (but 
Pc > P'c^^)) indicating the identification of the module. Moreover, one can see from Fig. |^a 
that there is an optimal regime for the threshold t, where pQ\t,pc) is (nearly) maximal. 
Within this regime pQ\t,pc) depends only weakly on t, so the convergence is robust with 
respect to the exact choice of the threshold. The size of this regime increases with pc- 

In order to quantify the relative increase of the fixed point value of the signal-to-noise 
ratio pg^(t, Pc) due to the application of the threshold function we define the ratio 

Kt,Pc)^ "°""-y,'-"°"'''^' . (36) 

PGit^Pc) 

where Pg\pc) refers to the value to which the signal-to-noise ratio converges when no 
threshold is applied. For pg^(t, pc) = we set r{t,pc) to zero. We show r{t,pc) as a 
function of t and pc in Fig. ^d. The figure shows that there exists a large region in the 
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parameter space of t and pc < p^^**, where the iterations only converge to a positive value 
due to the threshold. Moreover, even for pc > Pcf^*, where the iterative schemes converges 
to a positive value also without a threshold, there exists a large region, where p^ci^iPc) 
is significantly larger than pQ\t). Thus we conclude that the threshold function improves 
significantly (and in certain cases makes at all possible) the convergence of a noisy input 
set to a gene-vector that specifies the TM. 

We have also performed numerical simulations of the iterative scheme in eq. (|3TD . To 
this end we employed in-silico expression data that were generated according to eq. ([THp 
and superimposed with a certain level of noise. The initial gene sets were composed such 
that only the distribution of the genes scores associated with the module had a non-zero 
mean value, while the distribution of the remaining genes was centered around zero. The 
simulation allowed us to trace the evolution of the two distributions under the iterations. 
The results indicate a good agreement between the numerical and the analytical results. 
Details of this analysis are presented in Fig. ^. In particular, in Fig. ^ we show an example 
where only the application of a proper threshold leads to a separation between the two 
distributions. 

6 Beyond the single module 

In order to study the ISA in a more realistic scenario, we have performed further numerical 
simulations based on in-silico expression data encoding several, possibly overlapping tran- 
scription modules. These data were generated according to the following simple model: 
Each module is governed by a single (virtual) transcription factor whose activity is 
described by a pair of vectors {9^, Cm}- The non-zero components g^^^ of the gene-vector 
specify the genes that are transcribed if the transcription factor m is active, while the 
non-zero components c^^ of the condition-vector Cm specify the conditions that activate 
this transcription factor. Then for Nm modules the log expression of gene g at condition c 
is defined as E'^^ = J2m=i 9m ^^m- The final expression matrix is obtained by adding noise 
to these matrix elements. 

6.1 Expression data corresponding to two modules 

As initial example we consider in-silico expression data based on two transcription factors. 
We defined the components c^ and g!^^ for m = 1, 2 such that there are two overlapping 
transcription modules Mi and M2 (see Fig. ^for details). We applied the ISA to a collection 
of input sets composed of randomly chosen genes. We found that the structure of the 
resulting fixed points depends strongly on the threshold to- Fig. shows the corresponding 
output sets for a discrete choice thresholds: For a very low threshold {t ~ —2) the output 
sets contain essentially all the genes. Applying a somewhat higher threshold {t ~ —1) yields 
output sets containing all the genes that are associated with either of the two modules. 
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For a moderate threshold {t ~ 0) there are two types of output sets, comprising either the 
genes of Mi or M2. For a high threshold (t ~ 1) all the output set contain only those 
genes that belong to both modules. Finally, for a very high threshold {t ~ 2) the output 
sets are empty. For intermediate values of the threshold value one observes relatively 
sharp transitions between these well-defined fixed points (Fig. ^). At these transitions 
the correspondence between the output sets and the modular structure of the data is less 
precise. 

We have also varied the condition threshold tc- Interestingly, for not too large a 
threshold {tc ^ 2) the resulting gene output sets are almost independent of the choice of 
tc- However, the condition output sets depend critically on the value of tc and exhibit 
a similar behavior as the gene output sets in terms of structure (not shown). This is 
not surprising, since the ISA is symmetric with respect to genes and conditions. We 
conclude that scanning over different values of tc and tc reveals the modular structure 
of the expression data, starting from the "supermodule" MilJMi, over its overlapping 
components Mi and M2, to the "submodule" Mif]Mi. 

6.2 Expression data corresponding to many modules 

The above example shows that the ISA can identify overlapping modules. However, for 
Nm = 2 there exist only 2^ = 4 possible transcriptional states, so the 100 conditions of the 
expression data are highly redundant. For real data the situation is reverse: The number 
of experimental conditions is much smaller than the possible number of transcriptional 
states. In order to study how the ISA deals with such a scenario we considered a set 
of more realistic models based on many transcription modules. We investigated to what 
extend the ISA, as well as hierarchical clustering and SVD, were able to reconstruct these 
modules from the respective in-silico expression data. 

In the first numerical experiment we studied how the different algorithms handle noisy 
data. To this end we generated expression matrices corresponding to 1050 genes and 
1000 experimental conditions that belong to 25 modules of different sizes, each associated 
with a transcription factor. In order to focus on the effect of noise we considered only non- 
overlapping modules that do not share any genes or conditions. Onto the binary expression 
data we superimposed noise from a random distribution. We varied the width a of this 
distribution, simulating different levels of noise. 

In order to quantify how well the modules were identified by the different methods we 
proceeded as follows: For SVD we collected the 25 eigenvectors of the gene-gene correlation 
matrix that were associated with the largest eigenvalues. For each of the 25 modules we 
selected the eigenvector that had the largest overlap with the gene-vector characterizing 
the module, and in Fig. ^ we show the average Pearson coefficient between these two 
vectors (triangles). For hierarchical clustering we used the matlab implementation for 
average linkage to compute the complete hierarchical cluster tree. Using this cluster tree we 
partitioned the expression matrix using different cutoffs such that the resultant partitions 
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contained at least 15 and at most 40 clusters. From all these partitions we selected the one 
whose clusters had the highest average overlap with the gene content of the modules. This 
overlap is shown in Fig. ^ (squares). Finally, for the ISA we re-constructed the modules 
from the fixed points that occurred repeatedly. Namely, in order to avoid artifacts due to 
distinct, but very similar fixed points, we "fused" these solution using a procedure that 



resembles agglomerative clustering, albeit for modules rather than genes (see Ref. [|23 



for details). The fraction of correctly identified genes per module (circles) as well as the 
fraction of correctly identified modules (asterisks) is shown in Fig. ^ We conclude that for 
noisy data the identification capability of the ISA is superior to that of SVD and clustering. 
In particular, SVD is very sensitive to the addition of noise and fails to identify the modules 
accurately, even for a small level of noise. Clustering can handle a moderate amount of 
noise, but not as much as the ISA. 

A second numerical experiment was designed to study quantitatively the ability to 
identify overlapping modules. We specify the regulatory complexity by the the number 
of transcription factors per gene htf- Only if each gene (and condition) is associated 
with exactly one transcription factor {nxF = 1) the expression matrix can be written 
in block- diagonal form. For larger values of htf distinct modules share common genes 
and conditions and the expression matrix cannot be reorganized into in block-diagonal 
shape. We applied the SVD, hierarchical clustering and the ISA to the expression matrices 
generated for utf = 1, •••56 and evaluated the outputs in the same manner as described 
above (see Ref. for related results). The results are shown in Fig. One can see that 
the ISA could successfully identify all the transcription modules even in the case of highly 
overlapping modules. In contrast, for tt-tf > 1 the identification capabilities of SVD and 
clustering rapidly decrease. This is because the clustering algorithm does not allow for 
multiple assignments of one gene to different modules and therefore usually captures only 
small, incomplete fractions of the overlapping modules. Similarly, if the expression matrix 
cannot be reorganized into block- diagonal shape due to the overlap between the modules, 
the eigenvectors identified by SVD fail to characterize the modules properly. 



7 Applying the ISA to yeast expression data 

The analytical and numerical studies presented above indicate that the ISA is well-suited for 
the analysis of expression data. In this section we give a brief presentation of the biological 
insight that can be obtained from applying our method to real data. We analyzed a diverse 
set of more than 1000 DNA-chip experiments that were obtained by different groups 0. 
The yeast S. cerevisiae is an ideal model organism to test our algorithm, due to the wealth 
of expression data and the large amount additional biological knowledge that exists for 
this organism. 

We have applied the ISA to the yeast expression data using different values for the 
gene-threshold to = 1.8, 1.9, 4.0, while the condition-threshold was fixed to tc = 2.0. 
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(As we pointed out previously the gene-content of the modules depends only weakly on the 
exact choice for tc-) For each value of to we employed ~ 20, 000 randomly composed initial 
gene sets of various sizes in the search for fixed points. The modules were reconstructed 
from the recurrent fixed points using a similar algorithm as for the in-silico expression 
data. Indeed such a processing of the "raw" fixed points is needed to avoid many similar 
modules that biologically correspond to the same co-regulated unit. 

The number of modules increases with to, ranging between five at the lowest level 
(to = 1.8) to ~ 100 at the highest resolution (to = 4). In contrast, the typical module size 
declines rapidly as a function of to- The step- wise increasing of to exposed many chains 
of closely related modules that persist for finite ranges tc e Increasing to, 

the number of genes assigned to each element of the chain decreases until the size of the 
module declines sharply at to = t^^ and either disappears completely or splits into two or 
more sub-modules. Likewise decreasing to beyond tQ^^°"^ destabilizes the fixed point, since 
many unrelated genes are added to the module that pull the module towards a different 
fixed point. In this case the module may either 'merge' with another module or flow into 
a completely different fixed point. 

The five stable fixed points identified for tc = 1.8 correspond to the central functions 
of the yeast organism: protein synthesis, cell-cycle (Gl), mating, amino-acid biosynthesis 
and stress response. Each module contains between 100 and 300 genes. Protein synthesis 
and stress are the most dominant modules and comprise most of the experimental con- 
ditions of the data set. In fact, these modules remain fixed points throughout the entire 
range of thresholds considered here, and therefore can be considered the backbone of the 
transcriptional network. 

A visualization of this network is presented in Fig. |^a. For each threshold the corre- 
sponding modules are displayed in a plane, such that their distance reflects their correlation 
with respect to conditions. Moving to a higher threshold, nested sets of modules are kept 
in the same position in each plane, while the "new" modules are placed such that their 
position reflects best their correlation with the other modules. This organization of the 
chains of nested modules is somewhat similar to the data presentation by hierarchical trees 
commonly produced by cluster algorithms. However, in our case, chains of modules may 
extend over a finite range of to and distinct chains can contain common genes. Additional 
information, such as the number of input seeds that converged to the same fixed pointed 
(shown as pie charts in Fig. ^d), provide further inside into the transcriptional network. 

In a previous analysis of the same data we apphed the map in eqs. (|T^) and ([IT]) to a 
variety of biologically motivated input-sets {gf^} assembled according to prior knowledge 
of the regulatory sequence or function of the genes, and reconstructed the modules from 
recurrent realizations of the output-sets defined by g^^^ and c^^\ Remarkably, the ISA 
(which requires no information beyond the expression data whatsoever) revealed essentially 
all the co-regulated units that we found in this analysis, as well as several new transcription 
modules that had not been identified previously. Moreover, the ISA provides additional 
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insight into the modular organization through the evolution of the modules over different 
threshold values. Studying the functional annotations of the genes assigned to the modules, 
we observed a strong coherence for the genes that have been annotated in most of these 
modules. This suggests that the ISA provides a biologically meaningful decomposition 
into co-regulated units. A comprehensive discussion of the biological implications of this 
analysis is beyond the scope of this work and will be pursued elsewhere EE . 



8 Conclusions 

We have presented a novel method for the analysis of gene expression data. The innovation 
of our approach is twofold: On the conceptual level we provide a rigorous definition of what 
we want to extract from the expression data by introducing the notion of a transcription 
module (TM). Our definition in eq. (H) assigns to a TM both a set of co-regulated genes 
and the set of experimental conditions under which this co-regulation is the most stringent. 
The size of a TM depends critically on the associated set of two thresholds that determine 
the similarity between the genes and conditions of the module, respectively. The genes 
and conditions of a TM are mutually consistent implying that the latter can be obtained 
from the former and vice versa. The notion of a TM is well motivated biologically. Ideally 
the genes and conditions can be associated with a transcription factor or a (fraction of) a 
pathway. Importantly distinct modules may share both common genes and conditions. 

On the computational level our definition of a TM provides the basis for simple, but ef- 
ficient algorithm to obtain the modules encoded in the expression data. Starting from a set 
of randomly selected genes (or conditions) one refines iteratively the genes and conditions 
until they are mutually consistent and match the definition of a TM. The important point 
is that at each step of the iterations we apply a threshold function, thus maintaining only 
significantly co-regulated genes and the associated co-regulating conditions. The threshold 
stabilizes compact sets of co-regulated genes and prevents the introduction of noise from 
unrelated genes and conditions. Using a sufficiently large number of initial random sets 
it is possible to determine all the fixed points of the iterative scheme for a given pair of 
thresholds. Scanning through a range of values for these thresholds decomposes the data 
into modules at different resolutions. Since the computation time for each iteration of our 
algorithm scales only linearly with the total number of genes it is particularly well-suited 
for the analysis of large scale expression data. 

Considering a simplified scenario of a single transcription module embedded in a noisy 
background of unrelated genes, we showed analytically that the application of a threshold 
improves the convergence properties of the iterative scheme. Specifically, we considered 
the gene-vector that undergoes iterations as a stochastic entity and studied the evolution 
of its distribution under the iterations for a given threshold. This allowed us to quantify 
how the successful identification of the module depends on the size of the module and the 
noise in the expression data. 
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Our analytical insights were confirmed numerically using computer-generated expres- 
sion data. More complex gene regulation were also simulated in-silico. Considering a 
model with two overlapping transcription modules, we showed that applying the ISA using 
a range of threshold values reveals the structure of the expression data at different resolu- 
tions. Depending on the value of the threshold our algorithm can reveal each of the two 
modules, as well as their union and intersection. Using large computer-generated expres- 
sion matrices we studied the capability of the ISA to reveal a large number of overlapping 
transcription modules from noisy expression data. We find that our method is significantly 
more efficient at this task than standard tools, like SVD and clustering. 

The threshold functions as a resolution parameter in our analysis of real expression data. 
Using genome-wide expression data gathered in more than 1000 experimental conditions, 
we decomposed the yeast genome into sets of transcription modules at different resolutions. 
The modular decomposition reveals a hierarchical structure of the regulatory network. 
At the lowest resolution we identified five transcription modules that correspond to the 
central functions of the yeast organism. Increasing the threshold the number of modules 
increases while their size decreases. The functional coherence of these modules indicates 
both the reliability of our approach and the strong correlation between co-function and co- 
regulation at the transcriptional level in yeast. A comprehensive discussion of the biological 
implications of this analysis will be presented elsewhere . 



Finally we note that our formalism can be applied to analyze any data set that consists 
of multi-component measurements. While we presented our method in the context of 
gene-expression data, it is clear that our approach is well-suited to reveal the modular 
organization encoded in any data matrix. Applications of the ISA could include the analysis 
of biological data on protein-protein interactions or cell growth assays, as well as other large 
scale data, where a meaningful reduction of complexity is needed. 
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A Appendix 

A.l Singular Value Decomposition 

This appendix reviews Singular Value Decomposition (SVD), which is a common tool 
for the analysis of expression data. We use notations that make the similarities with 
the Iterative Signature Algorithm (ISA) the most apparent. SVD is used to reduce the 
dimensionality of the data by projecting it onto a subspace in such a way that as little 
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information is lost as possible. To this end consider the following matrix: 

Em = CmgJn, (37) 

whose elements = g'^ c^^ are simply the products of the components of a given gene- 
vector and condition- vectors Cm- For two binary vectors g^^ and Cm (whose elements 
are either zero or one) E'^ is unity if the module m contains the gene g and the condition c 
(i.e. the relevant vector components are 5^^^ = 1 and c^^ = 1). For real vectors g^ G IR^° 
and Cm G IR^'^ it is useful to rewrite the matrix in eq. (^^ as 

Em — f^m^rnSIm i (38) 

in terms of the normalized vectors gim = gm/\gm\ Cm = Cm/\cm\- This normalization 
removes the ambiguity in the choice of g^, and Cm due to the invariance of Em under the 
transformation g^ — ^ 0fl'm ^^"^ c.m/(t>i where 7^ is an arbitrary real number. 

The prefactor Hm = \gm\ Wm\ is just the product of the lengths of gm and c^. Then 
each module is associated with a triple (/im, Slm^ ^m) of a real number and two normalized 
vectors. Comparing the magnitude of any two matrix elements E^ and E^^' reveals the 
relative importance between the gene-condition pairs [g, c) and {g', d) for module m. 
Multiplying E^. with an arbitrary gene- vector g gives 

Emg = aCm with a = fimSfrnQ, (39) 

while multiplication of E^ = IJ^m Sim with any condition- vector c gives 

EmC = Pgm with p = fimC^C. (40) 

Thus Em and EJ^ are projection operators onto the one-dimensional spaces spanned by 
gm and Cm, respectively. Consequently theses matrices have rank 1. 

Now the basic idea of SVD is to reduce the complexity of the data by expressing E in 
terms of a relatively small number A^m('C Ng, Nc) of such rank 1 matrices: 

Nm 

E = J2Em + Rnm • (41) 



Here R denotes the residual term whose euklidean norm \R\ = yJ2g,c{R'^^y has to be 
minimized in order to optimize the decomposition into modules in the above equation. 
It is instructive to consider first the minimization for the case Nm = 1- We have 

\R\' = Y.iE'''-E^J(r = Y.iE'''-f^mC^^^&' (42) 

9,c g,c 

= E(^^')'-2^"^^'=^a^^^4^^ + -"™(^^^^4^^)'- (43) 

9,c 
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Setting the derivative of |-Rp with respect to the component c^^, 



d\R\ 

4c) 

Cm 



Y: -2fimE^'g^£ + 2fil{gi^^fct^ , (44) 



to zero we find that that iJLmC^^ = Hg 9m I ^gidmY recalhng the normahzation of 
and switching to vector notation: 

^imCm = Eg^. (45) 
Similarly equating d\R\'^ /g''^ to zero it follows that 

lJLm9m = E'^Cm ■ (46) 

This remarkable result implies that E^. can be determined simply by solving simultaneously 
the linear equations in eqs. ( ^5] ) and (|iB|). The latter is equivalent to a singular value 
decomposition (SVD) of the matrix E: 

G^EC = M , (47) 

where G = {g^, g2, ■■■,gr) and C = (ci, £2, c^.) are orthogonal matrices. M is a diagonal 
matrix of the same dimensions as E whose non-zero elements are given by /im and ordered 
such that > — ■■■ — A^r- — niin(A'"G, A^^c) is the rank of the expression matrix E. 



Combining eqs. (^5]) and (E3) one finds 



E^Eg^ = fi^g^, (48) 
EE'^Cm = ^^Cm , (49) 

implying that G is composed of the eigenvectors g^ of E^ E and C consist of the eigen- 
vectors Cm of EE^ . One way to solve the above equations is start with some initial 
gene-vector g^^\ obtain the corresponding condition-vector via c*-^-* = Eg^^'^ /\Eg^^^\ ac- 
cording to eq. d^), and use the result to compute g*-^-* = E'^ c^^'^ /\E'^ c!'^^\ using eq. (^). 
Iterating this alternating procedure as in eqs. ( [I3| ) and (|T^) converges to the pair (g^, ci) 
associated with largest eigenvalue /if = \Eg-^\^ provided that the initial vector g'^^^ was not 
orthogonal to gi- Thus the predominant module emerges as the "fixed point" of the above 
coupled equations. 

From eq. (H) it follows that \R\^ = Eg,c(^'^)^ - l^m- Hence for Nm = 1 the norm of 



the residual term, \R\^, is minimized exactly by the triple (/xi, §1, Ci). It is straightforward 
to extend this approach to the expansion of the expression matrix in terms of several 
modules as in eq. (|41|). To this end one first computes Ei = ^iCig\ as described above 
and applies the same scheme to the residual term Ri = E — Ei. This yields E2 = /i2C2^2 
associated with the second largest eigenvalue fi2- Repeating this procedure sequentially 
yields eventually the complete SVD of the matrix E. However, for practical purposes 
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it is usually sufficient to compute only a limited numbers of triples {nm^g^^Cm) with 
m = 1, Nm until the norm of the residual term |i?7VMp = J2g,c{E'^^)'^ ~Y^m=i A^m below 
a certain threshold. Thus, approximating the expression matrix in terms of a relatively 
small number of modules, Nm -C r reduces the complexity of the data. 

There are two interpretations for the expansion in eq. ( PT| ) that depend on the way the 
expression data is viewed. If we consider the data as a collection of gene-vectors as in 
eq. (|l|) , then eq. (|4l| ) translates into an expansion of these vectors in terms of a collection 
of gene-vectors, i.e. 

Nm 

9c= I^rncl^^a^m+gf (c = 1, Nc) , (50) 

m=l 

where {gim} is the basis (one for all g^), and the expansion coefficients are given by jJ-mCm 
(one for each g^). Moreover, for each g^ there is a residual gene- vector g^, that determines 
how well g^ is approximated by the sum. Conversely, if we consider the data as a collection 
of condition- vectors Cg as in eq. (|^), then the expansion in eq. ( ^T]) can be read as 

= E f^mgi^^^m + Cf (^ = 1, Ng) , (51) 
m=l 

where denotes the residual condition- vector. In this case the condition-vectors of the 
modules, provide the basis of expansion, while the expansion coefficients for each Cg 

are given by /im^^^- 

So far we have left the normalization of E unspecified. In fact the choice of normaliza- 
tion follows from the interpretation of the data, if, instead of a minimal residual term in 
eq. (^2]), one demands maximal variance among the principal components (the projections 
of the data rows or columns onto the eigenvectors associated with the largest eigenvalues). 
For example, if the expression data is viewed as a collection of gene-vectors, one would like 
to find the vector g^ that maximizes the variance of the principal components Ci''' = g^Sli, 
i.e. 

1 A^C 9 1 

Vf = ^^n^? - i^?):) (52) 
Here the bilinear term has been written in terms of the scatter matrix 

Nc 

S,^T.i9c-{9c)J i9c-{9c)f ■ (53) 

c=l 

Maximizing Vi under the constraint that gjgi = 1 is equivalent to finding the eigenvector 
of Sg associated with the largest eigenvalue. For normalized data, Sg coincides with the 
gene-gene correlation matrix 

Cg = ElEc with Cf = clcg,. (54) 
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Conversely, if the expression data is viewed as a collection of condition-vectors, the vector 
Ci that maximizes the variance of the components g^i"* = cjci, is the eigenvector associated 
with the largest eigenvalue of the scatter matrix 

Sc=Y. (^9 - {Cc,)g) {Cg - {Cg)g) . (55) 
9=1 

For normalized data, Sc equals to the condition-condition correlation matrix 

C, = EgEI with Cf=g^g,,. (56) 

Note, however, that since Eg 7^ Ec, the matrices Ec E'^ and EqEq are different from 
Cc and Cg, and do not represent correlation matrices. 



A. 2 The variance of a product of random variables 

By definition the mean of the product of two independent random variables a and b is the 
product of their mean values, i.e. 

{ab) = {a){b). (57) 

Since the expression for the variance of the product ab in eq. (|TPp may be somewhat less 
obvious, we give its derivation here. From the definition of the variance 

V{a)^{{a-{a)r) = {a')-{a)\ (58) 

we obtain 

V{a)V{b) = {{a') - {af) {{b') - {bf) (59) 

= {a'){b')-{af{b')-{a'){bf + {af{b)\ (60) 

Then using eqs. (p7D-(|60D it follows that 

V{ab) = {a^b^)-{abf (61) 

= {a'){b')~{af{bf (62) 

= Via)Vib) + {af{b'') + {a^){bf-2{af{bf (63) 

= V{a)V{b) + {{a') - (af) {bf + {{b') - (bf) {af (64) 

= V{a)V{b) + V{a){bf + V{b){a)\ (65) 



A. 3 Accurate treatment of the noise propagation 

In order to simplify our presentation of the propagation of the noise under the iterative 
scheme in eq. (p!7|) we used the approximate recursive equation in eq. (|^) to derive the 
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fixed point noise-to-signal ratio in eq. (^61) . Here we give an accurate treatment that is 
valid even if Nq^'^ ^ 1 is not satisfied. 



First, note that if the iterative scheme converges, then for n ^ oo we have f. 



(n-l) 



,(*) 



Vq and Vq 



~(n-l) 



-{*) 
Ve- 



in) 
G 



In this case we can write two fixed-point equations 



-(*) 

Vg 



(*) 



ANgVc \ 


_ 






_ 1 ] 


-{*) 
= ■Vg ■ 



,(*) 



+ 1), 



(66) 
(67) 



Solving eqs. ( |66D and (^) for Vq^ we get: 
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Here, the approximation on the right-hand-side neglects the term and yields exactly 

the same result as obtained from the simplified iterative scheme in eq. (p5|) that ignores 
the difference between Vq^ and Vq^ 

Interestingly, a necessary condition for convergence can be derived also without any 

I) and (|2^). To this end note that eq. (|2^ implies 



approximation directly from eqs. 

,(") ^ ~(") 



trivially that Vq >Vq . Then it follows that 



(n) , NgVc + A/'cT^ (n-l) , Vq 

^'G < ^-tt^^tt:— ^^G ' + 



N, 



(m) 



(69) 



Thus if 



vc < vT' 



G 



(70) 



the noise-to-signal ratio f^"* converges to a finite value. 
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Figure 1: How to properly normalize the expression matrix, (a) An in-silico expression 
matrix, corresponding to two overlapping modules of equal size and strength, was generated 
according to the model described in the text. The elements of the original expression 
matrix were scaled to E'g = E'^^SgSc, where Sg G [0, 1] and Sc G [0, 1] are random 
scale factors selected from a uniform distribution for each gene g and condition c. From 
Es we calculated the normalized expression matrices Eq and Ec according to eqs. 
and (^. (b) From the vector Ci, whose non-zero components cf^ specify the conditions 
of the upper- left module in (a) we calculated the vectors Qs = E^Ci, g(j = E'^Ci and 
Qq = EqCi. We plot their components (horizontal axes) g^g^ (black), g^^ (dark gray) and 
(light gray) as a function of the gene index (vertical axis). Only for g^, obtained 
according to normalization used in the ISA, all the components associated with the genes 
of the module are significantly larger than the others, (c) From the vector Qi, whose non- 
zero components gi^^ specify the genes of the upper-left module in (a) we calculated the 
vectors cg = EsQi, cq = EcQi and Qq = EqCi. We plot their components (horizontal 
axes) Cg^ (black), c[^'* (dark gray) and Cq (light gray) as a function of the condition index 
(vertical axis). Only for Cq, obtained according to normalization used in the ISA, all the 
components associated with the conditions of the module are significantly larger than the 
others. 
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Figure 2: Finding the fixed point value of the signal-to-noise ratio, (a) The fixed point value 
of the signal-to-noise ratio PG\t) is found by solving eq. ( p4D (c.f. section We plot its 

right-hand-side RHS{pG,t) = [iV^™Vi - (^cT^ + AiVG)/iVg"^] as a function of pc for 
several values of the threshold t as indicated in the legend (setting Nq = 6000, Nq^^ = 60, 
Pc = 1). RHS{pG,t) depends on pc and t through the effective numbers NQ^\t, pc) and 
ANcit) (defined in eqs. (|32|) and (^)) that denote the expected number of genes inside and 
outside the module that passed the threshold. Each curve increases monotonically from 
zero to its maximal value pQ°'^{t). For pc S> t, the effective number iVg"^ approaches Nq^\ 
In this limit pQ"-^{t) depends on t only through AA^^, which goes to zero for t ^ 1. Thus 



PQ°'^(t) y Nq Pq — 1 asymptotically. According to eq. (|3^) the fixed-point solutions for 
the signal-to-noise ratio PQ\t) are given by pc = RHS{pG,t) and therefore correspond to 
the intersections (indicated by the big dots) of these curves with the diagonal (shown as a 
dashed line), (b) The solutions in (a) are plotted as a function of the threshold t. For a 
relatively small threshold [t < 2) Pg\pg, t) increases rapidly as a function of t, saturates to 
p-max t > 2 and suddenly falls off to zero at a certain threshold ttransi^ 6). This behavior 
can be understood from (a): For a low threshold the intersection of curves for RHS{pG,t) 
with the diagonal appears at small values of pg- For larger t the intersections occur in the 
saturated regime of RHS{pG,t), such that pg — p'^"'^{t). However, if t is too large the 
curves do not intersect with the diagonal and there is no solution, (c) iVg"^(t)/Ar^"^ (dark 
gray) as well as ANG{t)/ANG (light gray) and Q{t) = iV^'"^(t)/(iV^™^(t) + AiVG(t)) (black) 
are shown as a function of t. g{t) ^ 1 for 3 ^ t < 6, indicating the optimal regime for the 
threshold. 
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Figure 3: Properties of the fixed point value of the signal-to-noise ratio, (a) The fixed 
point value of tlie signal-to-noise ratio, pQ\t,pc), characterizes the separability between 
the gene score distributions for the genes inside and outside the single module (c.f. sec- 
tion ^ for details). The plot shows pc) as a function of both the threshold t and 
the (fixed) signal-to- noise ratio in the expression matrix pc- For very small thresholds 
Pa\t,pc) vanishes if pc is below some critical value p'^^^ ~ 1.3. However, increasing the 
threshold the iterations converge to a finite fixed-point, pG^{t,pc) > 0, even if pc < pc"^^ 
(but Pc > Pc^^ ^ 0.5). There is an optimal regime for the threshold t, where pQ^t^pc) is 
(near to) maximal. Within this regime pQ\t,pc) depends only weakly on t, so the conver- 
gence is robust with respect to the exact choice of the threshold. The size of this regime 
increases with pc- (b) The ratio r{t,pc) = {pQ\t,pc) — Pg\pc))/ Pg\^' Pc) characterizes 
the improvement in the identification of transcription modules that is achieved by the 
application of the threshold function. {p%\pc) denotes the fixed-point value of the signal- 
to-noise ratio in the absence of a threshold, and r{t,pc) is set to zero for pQ\t,pc) = 0.) 
We show r{t,pc) as function of t and pc- The regime where pc < p^"* is subdivided into a 
white region {r{t,pc) = 1), where the iterative scheme only converges to a positive value, 
P^ci^iPc) > 0, due to the threshold and a black area {r{t,pc) = 0), where the iterative 
schemes does not converge to a positive value implying that the module cannot be iden- 
tified in this regime. Note that also for pc > Pc^^, where the iterative schemes converges 
to a positive value even without a threshold, there exists a large region in the parameter 
space of t and pc (the light gray area for r{t,pc)), where pQ\t,pc) is significantly larger 
than PG\t). 
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Figure 4: Evolution of the score distributions under the ISA. (a) The distributions of the 
gene scores of 100 input sets which serve as seeds for the iterations of our algorithm: The 
distribution of the genes that are not part of the TM (hght gray) has a vanishing mean 
value. The genes belonging to the module (black) are distributed with a positive mean 
value. Note that the two initial distributions cannot be distinguished from each other 
accurately, (b— c) Evolution of the two distributions under the iterative scheme defined by 
eq. ([T7|). (b) Without applying a threshold, the mean of the signal-distribution decreases 
in each iteration and the separability of the two distributions does not improve, (c) When 
a threshold (t = 1) is applied the mean of the signal distribution increases in each step 
until it saturates at a value where the two distributions are well separated, (d) The signal- 
to-noise ratio Pq-' characterizes the separability between the gene score distributions for 
the genes within and outside the module (c.f. section ^ for details). We plot ^ as a 
function of the number of iterations n. The evolution of Pq^ under the iterations scheme 
with (squares) and without (circles) a threshold obtained from the numerical simulation 
(gray) are in good agreement with the theoretical predictions (black) according to eq. (p5|). 
We used Nq = 1700, N^^ = 40 and pc = I for this figure. 
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(a) expression data (b) converged gene sets 




Figure 5: Identification of overlapping modules. An in-silico expression matrix describing 
500 genes under 100 experimental conditions was generated according to the model intro- 
duced in the text. The data corresponds to two overlapping transcription modules Mi and 
M2, each containing 250 genes and 50 conditions, (a) The expression matrix is shown for 
comparison on the left of each row. (b— c) Using this matrix we applied the ISA to 1000 
input sets composed of randomly chosen genes. Iterations were performed using different 
choices of the threshold to- (b) The boxes in each row represent 10 of the resulting con- 
verged gene sets, that were obtained for to as indicated on the left. Each box i = 1, 10 
is composed of 500 lines that specify the genes which appear in the corresponding fixed 
point. Genes that belong to the converged set are represented by a dark gray line, while 
the remaining genes are shown in light gray. For to — —2 the output sets contain all the 
genes, to — —1 yields output sets containing the genes that are associated with either of 
the two modules, for to — there are two types of output sets, comprising either the genes 
of Ml or of M2, for ^ 1 all the output set contain only those genes that belong to both 
modules and for tc — 2 the output sets are essentially empty, (c) The number of sets 
that converged (within 95% accuracy) to Mi[jMi (solid), Mi (dotted), M2 (dashed) or 
Ml n Ml (dash-dotted) are plotted as a function of to- Scanning over different thresholds 
reveals the modular structure of the expression data (Mi U Mi — > Mi, M2 — > Mi fl Mi). 
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Figure 6: Module identification from noisy expression data. In-silico expression matrices 
for 1050 genes under 1000 conditions, corresponding to 25 non-overlapping transcription 
modules of different sizes, were generated according to the model described in the text. 
Noise from a uniform distribution was superimposed onto this expression data. The width 
(7 of this noise distribution was varied, simulating different levels of noise. We quantified 
the efficiency of different algorithms to retrieve the modules from the expression data 
as described in the text. We show the fraction of correctly identified genes for the ISA 
(circles), hierarchical clustering (squares) and SVD (triangles). For the ISA we also the 
fraction of correctly identified modules are indicated (asterisks). SVD is very sensitive to 
the addition of noise and fails to identify the modules accurately, even for a small level of 
noise. Clustering can handle a moderate amount of noise, but not as much as the ISA. 
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Figure 7: Module identification in the presence of combinatorial regulation. In-silico ex- 
pression matrices corresponding to 25 overlapping modules were generated according to a 
model that allows for combinatorial regulation (see text for details). The degree of overlap 
between the modules is specified by the average number of transcription factors involved 
in the regulation of each gene {utf)- Only for utf = 1 each gene is associated with ex- 
actly one transcription factor. For larger values of utf distinct modules share common 
genes. We applied the SVD, hierarchical clustering (see Ref. |^ for related results) and 
the ISA to the expression matrices generated for n^rj? = 1, 6 and evaluated the outputs 
as described in the text. The ISA could successfully identify all the transcription mod- 
ules even in the case of highly overlapping modules (asterisks), The fraction of correctly 
identified genes per module only decreases slightly as a function of utf (circles). In con- 
trast, for Utf > ^ the identification capabilities of clustering (squares/crosses) and SVD 
(triangles) rapidly decrease. This is because the clustering algorithm does not allow for 
multiple assignments of one gene to different modules and therefore usually captures only 
small, incomplete fractions of the overlapping modules. Similarly, if the expression matrix 
cannot be reorganized into block- diagonal shape due to the overlap between the modules, 
the eigenvectors identified by SVD fail to characterize the modules properly. 



35 




Figure 8: Modular organization of yeast expression data. The iterative signature algorithm 
was apphed to genome wide yeast expression data gathered by more than 1000 DNA-chip 
experiments, (a) The figure shows the identified modules at three different gene-thresholds 
tc = {1.8, 2.1, 2.4}. For each threshold the corresponding modules are displayed in a plane, 
such that their distance reflects their correlation with respect to conditions. Moving to a 
higher threshold, corresponding of modules are kept in the same position in each plane, 
while the "new" modules are placed such that their position reflects best their correlation 
with the other modules. The left-most plane corresponds to the lowest threshold (to = 1.8), 
where only flve flxed points exist. The corresponding modules can be associated with 
central functions of the yeast organism: protein synthesis, cell-cycle (Gl), mating, amino- 
acid biosynthesis and stress response. We use color coding to indicate which of the flxed 
points that emerge at higher thresholds are related to these flve central modules (i.e. they 
would convergence to the respective module at the lowest threshold), b) The pie charts 
show for the number of random input sets that converged to the respective flxed point. 
The color coding is as in (a). 
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